Objective

The world indicators dataset compares different countries based on selected attributes. 1. To use K-means and hierarchical clustering methods to group similar countries together 2. To use Internal validation metrics such as Silhouette co-efficient to report the cluster quality 3. Tune the hyperparameters to report the best clustering solution. 4. To give a detailed list of all the groups and the countries included within each group 5. To generate few plots that would explain the characteristics of each cluster

Reading the dataset
df <- read_csv("World Indicators.csv")
Reading the structure of the dataset
str(df)
## spec_tbl_df [208 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Birth Rate            : num [1:208] 0.025 0.046 0.037 0.024 0.042 0.045 0.038 0.035 0.047 0.036 ...
##  $ Business Tax Rate     : chr [1:208] "72.00%" "52.10%" "65.90%" "19.50%" ...
##  $ Days to Start Business: num [1:208] 25 66 29 60 13 13 15 22 55 22 ...
##  $ Energy Usage          : num [1:208] 41852 13576 3761 2215 NA ...
##  $ GDP                   : chr [1:208] "$199,070,864,638" "$104,115,863,405" "$7,294,900,431" "$15,292,424,757" ...
##  $ Health Exp % GDP      : num [1:208] 0.044 0.034 0.045 0.052 0.064 0.09 0.054 0.039 0.028 0.036 ...
##  $ Health Exp/Capita     : chr [1:208] "$233" "$178" "$34" "$404" ...
##  $ Hours to do Tax       : num [1:208] 451 282 270 152 270 274 654 504 732 100 ...
##  $ Infant Mortality Rate : num [1:208] 0.023 0.107 0.06 0.039 0.068 0.059 0.064 0.1 0.092 0.061 ...
##  $ Internet Usage        : num [1:208] 0.1 0.1 0 0.1 0 0 0.1 0 0 0.1 ...
##  $ Lending Interest      : num [1:208] 0.08 0.188 NA 0.11 NA 0.132 NA NA NA 0.105 ...
##  $ Life Expectancy Female: num [1:208] 72 53 60 46 56 55 55 51 51 62 ...
##  $ Life Expectancy Male  : num [1:208] 69 50 58 47 55 51 53 47 49 59 ...
##  $ Mobile Phone Usage    : num [1:208] 0.9 0.5 0.8 1.5 0.5 0.2 0.5 0.2 0.3 0.3 ...
##  $ Population 0-14       : num [1:208] 0.272 0.477 0.432 0.34 0.458 0.44 0.432 0.404 0.487 0.422 ...
##  $ Population 15-64      : num [1:208] 0.681 0.499 0.539 0.625 0.517 0.535 0.535 0.558 0.488 0.549 ...
##  $ Population 65+        : num [1:208] 0.047 0.024 0.029 0.035 0.025 0.025 0.032 0.039 0.025 0.029 ...
##  $ Population Urban      : num [1:208] 0.682 0.409 0.423 0.565 0.265 0.109 0.521 0.39 0.22 0.28 ...
##  $ Region                : chr [1:208] "Africa" "Africa" "Africa" "Africa" ...
##  $ Country               : chr [1:208] "Algeria" "Angola" "Benin" "Botswana" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Birth Rate` = col_double(),
##   ..   `Business Tax Rate` = col_character(),
##   ..   `Days to Start Business` = col_double(),
##   ..   `Energy Usage` = col_double(),
##   ..   GDP = col_character(),
##   ..   `Health Exp % GDP` = col_double(),
##   ..   `Health Exp/Capita` = col_character(),
##   ..   `Hours to do Tax` = col_double(),
##   ..   `Infant Mortality Rate` = col_double(),
##   ..   `Internet Usage` = col_double(),
##   ..   `Lending Interest` = col_double(),
##   ..   `Life Expectancy Female` = col_double(),
##   ..   `Life Expectancy Male` = col_double(),
##   ..   `Mobile Phone Usage` = col_double(),
##   ..   `Population 0-14` = col_double(),
##   ..   `Population 15-64` = col_double(),
##   ..   `Population 65+` = col_double(),
##   ..   `Population Urban` = col_double(),
##   ..   Region = col_character(),
##   ..   Country = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
Checking for NA values
sort(colSums(is.na(df)))
##                 Region                Country       Population Urban 
##                      0                      0                      2 
##             Birth Rate         Internet Usage Life Expectancy Female 
##                      9                      9                     11 
##   Life Expectancy Male     Mobile Phone Usage        Population 0-14 
##                     11                     12                     17 
##       Population 15-64         Population 65+                    GDP 
##                     17                     17                     20 
##  Infant Mortality Rate       Health Exp % GDP      Health Exp/Capita 
##                     20                     23                     23 
##      Business Tax Rate Days to Start Business        Hours to do Tax 
##                     27                     27                     28 
##           Energy Usage       Lending Interest 
##                     72                     77

Energy Usage and Lending Interest columns have around 35% of NA values. Mostly we will be dropping these columns during our analysis

Converting GDP and Health Exp % GDP to numeric form
df$GDP <- parse_number(df$GDP)
df$`Health Exp/Capita` <- parse_number(df$`Health Exp/Capita`)
df$`Business Tax Rate` <- parse_number(df$`Business Tax Rate`)
Analyzing the correlation between various features
df_num <- as.data.frame(select_if(df, is.numeric))
corval <- cor(df_num, use="na.or.complete")
corrplot(corval, type="upper")

In the correlation plot above, it can be observed that few columns correlate well with the other columns. This means that few columns (or features) are redundant and can be dropped from the dataframe. Columns such as Birth Rate, Infant Mortality Rate, Internet Usage, Life Expectancy Female, Life Expectancy Male, Population 0-14, Population 15-64 correlate highly with each other. Another such group is Energy Usage and GDP as well as GDP, Health Exp % GDP, Health Exp/Capita. Instead of dropping all of these columns, we have to retain one column to represent these groups. For the last two groups, it is easy to determine the set of columns that we have to retain; we are going to retain GDP from both the groups. However, for the first group, it is difficult to externally determine the representing column. We will retain the column that has the highest correlation value with all other columns. To find out which column has the highest correlation, let’s use the “corval” matrix.

Determining the column with the highest correlation value among the above listed columns
colvec <- c("Birth Rate", "Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
           "Life Expectancy Male", "Population 0-14", "Population 15-64")
cval1 <- corval[colvec, colvec]
sort(colMeans(abs(cval1)), decreasing = TRUE)[1]
## Birth Rate 
##  0.8443748

It can be concluded that Birth Rate is the column that we want to retain among these 7 columns.

Dropping the other 9 columns
colvec2 <- c("Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
           "Life Expectancy Male", "Population 0-14", "Population 15-64", "Energy Usage", 
           "Health Exp % GDP", "Health Exp/Capita")

df2 <- df[, !(names(df) %in% colvec2)]
df_num <- df_num[, !(names(df_num) %in% colvec2)]
Check NA values
sort(colSums(is.na(df2)), decreasing=TRUE)
##       Lending Interest        Hours to do Tax      Business Tax Rate 
##                     77                     28                     27 
## Days to Start Business                    GDP         Population 65+ 
##                     27                     20                     17 
##     Mobile Phone Usage             Birth Rate       Population Urban 
##                     12                      9                      2 
##                 Region                Country 
##                      0                      0
Lets go ahead and drop the columns which has more than 10% (>20 NA values) data missing. We are doing so because if we replace these values with either mean or median or zero; then we will introduce bias and these records with replaced values will have similar features and eventually form a cluster, which we would want to avoid. For the remaining features, let’s replace the NA values with the mean in the numerical columns.
coldrop <- c("Lending Interest", "Hours to do Tax", "Business Tax Rate", "Days to Start Business")
df2 <- df2[, !names(df2) %in% coldrop]
df_num <- df_num[, !names(df_num) %in% coldrop]

colmean <- as.list(colMeans(df_num, na.rm=TRUE))
df_num <- replace_na(df_num, colmean)

df2 <- replace_na(df2, colmean)

We have the final set of 5 columns (or features) that we are going to perform clustering. ##### Let us feature scale each of the columns to obtain best clustering result. But before that, let us boxplot each of these features quickly to see if there are any outliers

for (i in (1:5)) {
  boxplot(df2[i], xlab=names(df2[i]))
}

There are many outliers especially for the column GDP. Hence, we will have to scale the values using Standard Scaler function defined by –

Standard Scaler

df2[1:5] <- scale(df2[1:5], center=TRUE, scale=TRUE)

Now we are ready for clustering. However, we need to find the optimal number of clusters first. ##### Let us implement Elbow method and Silhouette method to get the number of clusters.

#kmeans(df2[1:5], )
fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "wss")

fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "silhouette")

Both Elbow method (WSS) and Silhouette method suggests that 2 is the optimal number of clusters.

However, let’s perform kmeans clustering with centers from 2 to 10 in a loop and analyze each. We will use Dunn Index and Silhouette Co-efficient as two internal analysis metrics to evaluate.

Dunn Index

Silhouette Average Width Co-efficient

centersloop=2:10
dunns <- list()
sse <- list()
ssb <- list()
s1 <- list()
d1 <- dist(df2[1:5], method = "euclidean")
for (centers in centersloop) {
  k1 <- kmeans(df2[1:5], centers = centers, iter.max=100000, nstart=200)
  sse <- append(sse, k1$tot.withinss) #Sum of Squared Errors within cluster
  ssb <- append(ssb, k1$totss)        #Group Sum of Squares 
  dunns <- append(dunns, dunn(clusters=k1$cluster, Data=df2[1:5]))  #Dunn index
  s1 <- append(s1, summary(silhouette(k1$cluster, d1))$avg.width)   #Silhouette Average Width
}
Implementing Hierarchical Clustering
clusters <- 2:10
d1 <- dist(df2[1:5], method = "euclidean")
hc1 <- hclust(d1, method = "complete")

dunns2 <- list()
s2 <- list()

plot(hc1, cex = 0.6, hang = -1)

for (c in clusters) {
  hclust1 <- cutree(hc1, k = c)
  dunns2 <- append(dunns2, dunn(clusters=hclust1, Data=df2[1:5]))  #Dunn index
  s2 <- append(s2, summary(silhouette(hclust1, d1))$avg.width)   #Silhouette Average Width
}
Comparing the Silhouette scores of K means and Hierarchical clustering to choose the optimal number of clusters and the algorithm
plot(x=2:10, y=s1, type="b", xlab="Number of Clusters", ylab="Silhouette co-efficient of K means")

plot(x=2:10, y=s2, type="b", xlab="Number of Clusters", 
     ylab="Silhouette co-efficient of Hierarchical clustering")

#Hierarchical clustering with 2 centers gives erroneous results and we are ignoring it deliberately

The above plots indicate that we can get the best clustering solution by implementing kmeans with center equals 3. ##### Implementing K means with 3 centers.

k1 <- kmeans(df2[1:5], centers = 3, iter.max=100000, nstart=20)
print("Number of countries in each cluster")
## [1] "Number of countries in each cluster"
print(table(k1$cluster))
## 
##   1   2   3 
##   3  79 126
c1 <- subset(df, k1$cluster == 1)
c2 <- subset(df, k1$cluster == 2)
c3 <- subset(df, k1$cluster == 3)

print("Countries in the First Cluster")
## [1] "Countries in the First Cluster"
print(c1[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 3 × 5
##   Country       Region           GDP `Life Expectancy Female` `Birth Rate`
##   <chr>         <chr>          <dbl>                    <dbl>        <dbl>
## 1 China         Asia         7.32e12                       76        0.012
## 2 Japan         Asia         5.91e12                       86        0.008
## 3 United States The Americas 1.55e13                       81        0.013
print("Countries in the Second Cluster")
## [1] "Countries in the Second Cluster"
print(c2[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 79 × 5
##    Country                  Region          GDP `Life Expectancy F… `Birth Rate`
##    <chr>                    <chr>         <dbl>               <dbl>        <dbl>
##  1 Angola                   Africa 104115863405                  53        0.046
##  2 Benin                    Africa   7294900431                  60        0.037
##  3 Burkina Faso             Africa  10395757480                  56        0.042
##  4 Burundi                  Africa   2355652064                  55        0.045
##  5 Cameroon                 Africa  25486923059                  55        0.038
##  6 Central African Republic Africa   2195599491                  51        0.035
##  7 Chad                     Africa  12156380062                  51        0.047
##  8 Comoros                  Africa    610372697                  62        0.036
##  9 Congo, Dem. Rep.         Africa  23831631365                  51        0.044
## 10 Congo, Rep.              Africa  14425606793                  59        0.038
## # … with 69 more rows
print("Countries in the Third Cluster")
## [1] "Countries in the Third Cluster"
print(c3[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 126 × 5
##    Country      Region          GDP `Life Expectancy Female` `Birth Rate`
##    <chr>        <chr>         <dbl>                    <dbl>        <dbl>
##  1 Algeria      Africa 199070864638                       72        0.025
##  2 Botswana     Africa  15292424757                       46        0.024
##  3 Gabon        Africa  18796191833                       64        0.032
##  4 Libya        Africa  34699395524                       77        0.022
##  5 Mauritius    Africa  11252405860                       77        0.011
##  6 Morocco      Africa  99211339029                       72        0.022
##  7 Seychelles   Africa   1059593023                       78        0.019
##  8 South Africa Africa 403894316555                       57        0.021
##  9 Tunisia      Africa  45951129422                       77        0.019
## 10 Armenia      Asia    10142342770                       78        0.014
## # … with 116 more rows
Lets understand the details of these 3 clusters by plotting a heatmap of the centers of these clusters.
print(k1$centers)
##   Birth Rate         GDP Mobile Phone Usage Population 65+ Population Urban
## 1 -1.0261167  6.93525546         -0.2195440      1.4622538        0.6809755
## 2  1.0153847 -0.22878806         -0.8812683     -0.8066485       -0.8577265
## 3 -0.6121988 -0.02167865          0.5577685      0.4709402        0.5215672
heatmap(k1$centers, Rowv=NULL, Colv=NULL, xlab="Features", ylab="Clusters")

The above heatmap can give a gist of the characteristics of each of the cluster created by the algorithm

Scatter plot between Birth Rate and GDP and the clusters
df$Clusters <- k1$cluster

fig1 <- plot_ly(data = df, 
          x=df$`Birth Rate`, y=df$GDP, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Birth Rate vs GDP",
      xaxis = list(title = 'Birth Rate'),
      yaxis = list(title = 'GDP', type="log")
    )

fig1
## Warning: Ignoring 24 observations
# ggplot(df, aes(x=df$`Birth Rate`, y=GDP, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_y_continuous(trans="log", labels=dollar)+
#   labs(x="Birth Rate", y="GDP", color="Clusters", title="Birth Rate vs GDP")
  #theme(legend.title = "Clusters")

Here, it can be observed that Birth Rate and GDP are inversely related. And the clusters divided can be observed to be as - (The order might differ with each successive runs) 1. Lower GDP and Higher Birth Rate 2. Higher GDP and Lower Birth Rate 3. Medium GDP and Medium Birth Rate

Scatter plot between Health Exp/Capita vs Infant Mortality Rate abd the clusters
fig1 <- plot_ly(data = df, 
          x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Health Exp/Capita vs Infant Mortality Rate",
      xaxis = list(type="log", title = 'Health Exp/Capita'),
      yaxis = list(title = 'Infant Mortality Rate')
    )

fig1
## Warning: Ignoring 23 observations
# ggplot(df, aes(x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_x_continuous(trans="log", labels=dollar)+
#   labs(x="Health Exp/Capita", y="Infant Mortality Rate", color="Clusters")
#   #theme(legend.title = "Clusters")

As one would expect, Infant mortality rate and Health Expenditure Per capita is inversely proportional and again we observe 3 clusters - 1. High IMR and Low Health Expenditure per capita 2. Moderate IMR and Moderate Health Expenditure per capita 3. Low IMR and High Health Expenditure per capita

Scatter plot between Population Urban vs GDP and the clusters
#df$Clusters <- k1$cluster

fig1 <- plot_ly(data = df, 
          x=df$`Population Urban`, y=df$GDP, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Population Urban vs GDP",
      xaxis = list(title = 'Population Urban'),
      yaxis = list(type="log", title = 'GDP')
    )

fig1
## Warning: Ignoring 21 observations
# ggplot(df, aes(x=df$`Population Urban`, y=GDP, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_y_continuous(trans="log", labels=dollar)+
#   labs(x="Population Urban", y="GDP", color="Clusters")
#   #theme(legend.title = "Clusters")

A positive trend can be observed between population living in the urban and GDP. The clusters divided can be observed as - 1. Lower population in the urban and Lower GDP 2. Higher population in the urban and Higher GDP 3. Scattered cluster

Conclusion

The clusters divided make sense even in the real world where the top performing countries like USA, China, Japan are in one cluster and then the moderate performing (majority of the European countries) and the poorly performing countries (majority of African countries). As the dataset contains the features that are mostly related to economy and healthy lifestyle, we get these clusters from the algorithm. If we wish to see these based on other factors (natural resources availability, human resources etc.), then we should add more features.